Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make AnsibleTower checkout/execute actions more resilient #341

Merged
merged 1 commit into from
Jan 15, 2025

Conversation

JacobCallahan
Copy link
Member

We've been seeing some issues with service interruptions in AAP under high load. While the jobs do complete successfully, awxkit bails when encountering the connection issue.
With this change, we simple enter a retry loop when monitoring job status.

@JacobCallahan JacobCallahan added the enhancement New feature or request label Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

broker/providers/ansible_tower.py:622

  • The new retry logic introduced by resilient_job_wait should be covered by tests to ensure it works correctly.
resilient_job_wait(job)

pyproject.toml:131

  • [nitpick] Removing the TRY302 rule might allow exception handlers that re-raise errors immediately, which could lead to less clear error handling. Consider keeping this rule or ensuring that exception handlers are reviewed carefully.
"TRY302",  # Remove exception handler; error is immediately re-raised
We've been seeing some issues with service interruptions in AAP under
high load. While the jobs do complete successfully, awxkit bails when
encountering the connection issue.
With this change, we simple enter a retry loop when monitoring job
status.
@bherrin3
Copy link
Contributor

  1. I was running AT-01 at full/over-full chat and deployed another 200 or so VMs using this PR which put AT-01 on life-support.
  2. I had to commute so I shutdown my machine with an active broker session
  3. Made it to location and had to reconnect VPN
  4. All instances were hung in AT-01 and I had to give AT-01 CPR by killing A LOT of jobs including these
  5. Once the jobs were killed, my terminal noted and reported the job failures.

That seems pretty indicative that this works for intent.

@JacobCallahan JacobCallahan merged commit 81af39e into SatelliteQE:master Jan 15, 2025
4 checks passed
@JacobCallahan JacobCallahan deleted the yolo_checkout branch January 15, 2025 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants